Libraries

First we import the libraries that we will need for this project. We will be using MASS to simulate data from a negative binomial distribution and rstan for the HMC implementation.

require("MASS")
require("rstan")
options(mc.cores = parallel::detectCores())

Data

Next we load data for the project. Since we are analyzing survey responses about names and occupations, there are three sets of data. Firstly, we have the survey responses themselves, collected from the Omni study. Then, for each name asked about in the Omni study, we have its age-sex distribution, estimated from the social security administration’s records of everyone born in the last 100 years, and their names at date of birth. Lastly, for each occupation asked about in the Omni study, we have its age-sex distribution, estimated from a separate survey asking respondents of various ages and genders about their occupations.

source("src/load_data/surveys.R")
source("src/load_data/names.R")
source("src/load_data/occs.R")

With the data imported, we also load the functions that will help us simulate the data and plot the fitted models easily.

source("src/funcs/mix_kernel.R")

Model

The goal of the project is to model \(\bf{\mu_{ik}}\), the expected number of alters in a subpopulation \(\bf{\mathcal{G}_k}\) that an ego of age \(\bf{a_i}\) and gender \(\bf{g_i}\) knows. Letting \(\bf{d_i}\) denote ego \(i\)’s degree, \(\bf{\{i \to j\}}\) denote the event that ego \(i\) knows alter \(j\), \(\bf{g_j} \in \{M, F\}\) denote the gender of alter \(j\), and \(\bf{a_j} \in (0,100)\) denote the age of alter \(j\), the quantity can be expressed as: \[\begin{align} \mu_{ik} &= \mathbb{E} \biggl[ \sum_{j=1}^{d_i} \mathbb{1}\{ j \in \mathcal{G}_k\} \biggr] && \\ &= \sum_{j=1}^{d_i} p( j \in \mathcal{G}_k | a_i, g_i, i \to j) && \\\nonumber &= d_i p( j \in \mathcal{G}_k | a_i, g_i, i \to j) && \\\nonumber &= d_i \sum_{g_j} \int_{a_j} p( j \in \mathcal{G}_k, a_j, g_j | a_i, g_i, i \to j) da_j, \end{align}\]

where we use the additional data we have on the age and gender distributions of \(\mathcal{G}_k\) to integrate over alter age and gender.

Kernel Framework

The expression above can be simplified to a closed form solution, but before we do that, it’s important to construct our novel framework of modeling the demographic distribution of an ego’s newtork.

[1] We use a kernel to model the age distribution \(\bf{p(a_j)}\) of alters with gender \(\bf{g_j}\) known by an ego of age \(\bf{a_i}\) and gender \(\bf{g_j}\):

\[\begin{align} p(a_j | g_j, a_i, g_i, i \to j) &= \frac{ 1 }{ \sqrt{ 2\pi\lambda_{g_ig_j} } } \exp\biggl\{ -\frac{ (a_i - a_j)^2 }{ 2\lambda_{g_ig_j} } \biggr\} && \\\nonumber &= \mathcal{N}(a_j | a_i, \lambda_{g_ig_j}), \end{align}\]

where \(\bf{\lambda_{g_ig_j}}\) is the kernel bandwidth parameter specific to the ego and alter’s genders. This is also called a mixing kernel.

[2] We model the gender distribution of alters known by the ego as elements of a 4-by-4 matrix \(\bf{\rho}\): \[ p(g_j | a_i, g_i, i \to j) = p(g_j | g_i, i \to j) = \rho_{g_ig_j}, \] where the rows of the gender mixing are constrained \(\sum_{g_j} \rho_{g_ig_j} = 1\).

Thus, combining [1] and [2], the alter demographic distribution of an ego’s network can be modeled as: \[\begin{align} p(a_j, g_j | a_i, g_i, i \to j) &= p(g_j | a_i, g_i, i \to j) p(a_j | g_j, a_i, g_i, i \to j) && \\\nonumber &= \rho_{g_ig_j} \mathcal{N}(a_j | a_i, \lambda_{g_ig_j}). \end{align}\]

Simplification

With the above terms defined, the expected number of alters known group \(\mathcal{G}_k\) can be simplified as: \[\begin{align} \mu_{ik} &= d_i \sum_{g_j} \int_{a_j} p( j \in \mathcal{G}_k, a_j, g_j | a_i, g_i, i \to j) da_j && \label{eq:nonrandom_mixing_kernel_derivation} \\\nonumber &= d_i \sum_{g_j} \int_{a_j} p( j \in \mathcal{G}_k | a_j, g_j) p(a_j, g_j | a_i, g_i, i \to j) && \\\nonumber &= d_i \sum_{g_j} \int_{a_j} p( j \in \mathcal{G}_k | a_j, g_j) \rho_{g_ig_j} \mathcal{N}(a_j | a_i, \lambda_{g_ig_j}) && \\\nonumber &= d_i \sum_{g_j} \rho_{g_ig_j} \int_{a_j} p( j \in \mathcal{G}_k | a_j, g_j) \mathcal{N}(a_j | a_i, \lambda_{g_ig_j}) && \\\nonumber &= d_i \sum_{g_j} \rho_{g_ig_j} \int_{a_j} \biggl( \frac{ \int_{a} p( j \in \mathcal{G}_k | a, g_j) da }{ \int_{a} p( j \in \mathcal{G}_k | a, g_j) da } \biggr) p( j \in \mathcal{G}_k | a_j, g_j) \mathcal{N}(a_j | a_i, \lambda_{g_ig_j}) da_j && \\\nonumber &= d_i \sum_{g_j} \rho_{g_ig_j} \biggl( \int_{a} p( j \in \mathcal{G}_k | a, g_j) da \biggr) \int_{a_j} \biggl( \frac{ p( j \in \mathcal{G}_k | a_j, g_j) }{ \int_{a} p( j \in \mathcal{G}_k | a, g_j) da } \biggr) \mathcal{N}(a_j | a_i, \lambda_{g_ig_j}) da_j && \\\nonumber &\approx d_i \sum_{g_j} \rho_{g_ig_j} \biggl( \sum_\alpha \mathbb{P}( j \in \mathcal{G}_k | \alpha, g_j) \biggr) \int_{a_j} \mathcal{N}(a_j | \mu_{kg_j}, \sigma_{kg_j}^2) \mathcal{N}(a_j | a_i, \lambda_{g_ig_j}) da_j && \\\nonumber &= d_i \sum_{g_j} \rho_{g_ig_j} \biggl( \sum_{a_j} \frac{ N_{k, a_j, g_j} }{ N_{a_j, g_j} } \biggr) \mathcal{N}(a_i | \mu_{kg_j}, \lambda_{g_ig_j} + \sigma_{kg_j}^2) && \\\nonumber &= d_i \sum_{g_j} \rho_{g_ig_j} \biggl( \sum_{a_j} \frac{ N_{k, a_j, g_j} }{ N_{a_j, g_j} } \biggr) \frac{ e^{ -\frac{ (a_i - \mu_{kg_j})^2 }{ 2(\lambda_{g_ig_j} + \sigma_{kg_j}^2) } } }{ \sqrt{ 2\pi(\lambda_{g_ig_j} + \sigma_{kg_j}^2) } }, \end{align}\]

where we approximate the normalized (with respect to age) distribution \(\frac{ p( j \in \mathcal{G}_k | a_j, g_j) }{ \int_{a} p( j \in \mathcal{G}_k | a, g_j) da }\) by the normal distribution \(\mathcal{N}(a_j | \mu_{kg_j}, \sigma_{kg_j}^2)\). The approximating normal’s moments, \(\mu_{kg_j}\) and \(\sigma_{kg_j}^2\), are estimated from \(\mathcal{G}_k\)’s population data, and are analogous to the mean and standard deviation of the age distribution of individuals in \(\mathcal{G}_k\) with gender \(g_j\). Additionally, since survey respondents’ ages are generally observed discretely, we approximate the integral \(\int_{a} p( j \in \mathcal{G}_k | a, g_j) da\) by the summation \(\sum_\alpha \mathbb{P}( j \in \mathcal{G}_k | \alpha, g_j)\) over discrete age \(\alpha \in \{0,1,2,..\}\).

Spline Framework

Note that the formulation in [1] above implies that there are four bandwidth values \(\lambda_{g_ig_j}\). This implies that the variance of an ego’s alter age distribution depends on the ego’s gender but is independent of the ego’s age. To add a dependence on ego age, we replace the bandwidth with a functional bandwidth \(\bf{\lambda_{g_ig_j}(a_i)}\). To keep the overall model tractable, we further impose structure on the functional bandwidth by setting it to a fourth order B-spline. We set the knots \(t_q\) of the spline to the deciles of the population age distribution (i.e. 11 knots). Hence, the number of basis functions \(N\) is 13 (= 11 + 4 - 2), and the functional bandwidth becomes: \[ \lambda_{g_ig_j}(a_i) = a_0^{g_ig_j} + \sum_{i=1}^N a_i^{g_ig_j} B_{i,k}(x) \] where \(B_{i,k}(x)\) are the recursively defined spline basis functions.

Since this functional bandwidth does not depend on \(a_j\), it is a constant with respect to the integrals in the Simplification section. Hence, the simplication is updated to: \[\mu_{ik} = d_i \sum_{g_j} \rho_{g_ig_j} \biggl( \sum_{a_j} \frac{ N_{k, a_j, g_j} }{ N_{a_j, g_j} } \biggr) \frac{ e^{ -\frac{ (a_i - \mu_{kg_j})^2 }{ 2(\lambda_{g_ig_j}(a_i) + \sigma_{kg_j}^2) } } }{ \sqrt{ 2\pi(\lambda_{g_ig_j}(a_i) + \sigma_{kg_j}^2) } }\]

Stan

We can encapsulate the above framework into the following Stan model:

functions {
  // Returns the integration of the product of two normals
  real norm_prod_int(int mu1, real mu2, real var1, real var2) {
    real K;
    K = 1/sqrt(2 * 3.141593 * (var1 + var2)) * exp(-(mu1 - mu2)^2/(2 * (var1 + var2)));
    return K;
  }
  
  // Builds the B spline with specified parameters
  vector build_b_spline(real[] t, real[] ext_knots, int ind, int order);
  vector build_b_spline(real[] t, real[] ext_knots, int ind, int order) {
    // INPUTS:
    //    t:          the points at which the b_spline is calculated
    //    ext_knots:  the set of extended knots
    //    ind:        the index of the b_spline
    //    order:      the order of the b-spline
    vector[size(t)] b_spline;
    vector[size(t)] w1;
    vector[size(t)] w2;
    w1 = rep_vector(0, size(t));
    w2 = rep_vector(0, size(t));
    if (order==1) {
      for (i in 1:size(t)) // B-splines of order 1 are piece-wise constant
        b_spline[i] = (ext_knots[ind] <= t[i]) && (t[i] < ext_knots[ind+1]);
    }
    else {
      if (ext_knots[ind] != ext_knots[ind+order-1])
        w1 = (to_vector(t) - rep_vector(ext_knots[ind], size(t))) /
             (ext_knots[ind+order-1] - ext_knots[ind]);
      if (ext_knots[ind+1] != ext_knots[ind+order])
        w2 = 1 - (to_vector(t) - rep_vector(ext_knots[ind+1], size(t))) /
                 (ext_knots[ind+order] - ext_knots[ind+1]);
      // Calculating the B-spline recursively as linear interpolation of two lower-order splines
      b_spline = w1 .* build_b_spline(t, ext_knots, ind, order-1) +
                 w2 .* build_b_spline(t, ext_knots, ind+1, order-1);
    }
    return b_spline;
  }
}

data {
  int<lower=0> K_1;                       // number of names
  int<lower=0> K_2;                       // number of occupations
  int<lower=1> N;                         // number of respondents
  int<lower=1> N_K;                       // number of knots for the bandwidth spline
  int<lower=1> N_S;                       // number of grid points to evaluate spline on
  int<lower=1> D;                         // degree of bandwidth spline (order - 1)
  real age_mean;                          // average age

  int Y[N,K_1+K_2];                       // responses to how many X do you know
  real w[N];                              // response survey weights
  int age[N];                             // age of respondents
  int<lower=1> g_n[N];                    // gender of respondents: 1=female 2=male
  int<lower=1> g_k[K_1+K_2*2];            // genders of names: 1=female 2=male (names have one each, but occs have two genders each)

  real<lower=0> mu_k[K_1+K_2*2];          // age mean for each gender-group k
  real<lower=0> sigma_k[K_1+K_2*2];       // age standard deviation for each gender-group k
  real<lower=0> sum_prob_k[K_1+K_2*2];    // sum population proportion of each gender-group k
  
  vector[N_K] knots;                      // sequence of knots for the bandwidth spline
  real X[N_S];                            // age grid over which to evaluate spline

  real mu_d;                              // prior mean for the log_degrees
  real<lower=0> sigma_d;                  // prior sd for the log_degrees
  real<lower=0> alpha_omega;              // prior alpha for inv_omega
  real<lower=0> beta_omega;               // prior beta for inv_omega
  vector<lower=0>[2] alpha_rho;           // prior alpha for gender mixing rows
  vector[3] mu_beta;                      // prior mean for beta
  vector<lower=0>[3] sigma_beta;          // prior sd for beta

  real recall_power;                      // 0 = no recall adjustment, 0.5 = recall adjustment
  real degree_regression;                 // 0 = simple prior, 1 = agesex regression prior
}

transformed data {
  int K;                                  // total number of groups
  int N_B;                                // total number of B-splines              
  matrix[N_K + D - 1, N_S] B;             // matrix of B-splines
  vector[D + N_K] ext_knots_temp;
  vector[2*D + N_K] ext_knots;            // set of extended knots
  K = K_1 + K_2;
  N_B = N_K + D - 1;    
  ext_knots_temp = append_row(rep_vector(knots[1], D), knots);
  ext_knots = append_row(ext_knots_temp, rep_vector(knots[N_K], D));
  for (ind in 1:N_B)
    B[ind,:] = to_row_vector(build_b_spline(X, to_array_1d(ext_knots), ind, D + 1));
  B[N_B, N_S] = 1;
}

parameters {
  vector[N] log_d;                        // log respondent degrees
  real<lower=0,upper=1> inv_omega[K];     // inverse group overdispersions
  simplex[2] rho[2];                      // gender mixing rates
  vector[3] beta;                         // degree regression coefs (third is logged)
  real log_eta;                           // log sd of degree
  
  row_vector[N_B] a_raw[2,2];             // raw spline coefficients
  real a0[2,2];                           // spline intercept
  real<lower=0> tau[2,2];                 // spline coefficient prior sd
}

transformed parameters {
  real<lower=0> d[N];                     // respondent degrees
  real<lower=0> omega[K];                 // group overdispersions
  real<lower=0> eta;                      // sd of degree
  row_vector[N_B] a[2,2];                 // spline coefficients
  vector[N_S] lambda[2,2];                // spline evaluated at the age grid
  for(n in 1:N) {
    d[n] = exp(log_d[n]);
  }
  for(k in 1:K) {
    omega[k] = inv(inv(inv_omega[k]) - 1);
  }
  eta = exp(log_eta);
  for(i in 1:2) {
    for(j in 1:2) {
      a[i,j] = a_raw[i,j] * tau[i,j];
      lambda[i,j] = exp(a0[i,j] + to_vector(a[i,j] * B));
    }
  }
}

model {
  vector[N] ex_log_d;
  real mu_nk;
  int age_;

  // Parameter Priors
  for(p in 1:3) {
    beta[p] ~ normal(mu_beta[p], sigma_beta[p]);
  }
  log_eta ~ normal(-0.7, 0.1);
  inv_omega ~ beta(alpha_omega, beta_omega);
  for(i in 1:2) {
    rho[i] ~ dirichlet(alpha_rho);
    for(j in 1:2) {
      a_raw[i,j] ~ normal(0, 2);
      a0[i,j] ~ normal(0, 2);
      tau[i,j] ~ normal(0, 2);
    }
  }

  // Degree Priors
  if(degree_regression == 1) {
    for(n in 1:N) {
      ex_log_d[n] = beta[1] + 
                    beta[2] * (g_n[n]-1) - 
                    exp(beta[3]) * ((age[n] - age_mean)/age_mean)^2;
    }
    log_d ~ normal(ex_log_d, eta);
  }
  else {
    log_d ~ normal(mu_d, sigma_d);
  }

  // Likelihood
  for(k in 1:K) {
    for(n in 1:N) {
      if(Y[n,k] >= 0) {
        mu_nk = rho[g_n[n], g_k[k]] *  
                sum_prob_k[k] * 
                norm_prod_int(age[n], 
                              mu_k[k], 
                              lambda[g_n[n], g_k[k]][age[n]-17], 
                              sigma_k[k]^2);
        // For occupations, we add a term for other gender
        if(k > K_1) {
          mu_nk = mu_nk + rho[g_n[n], g_k[k+K_2]] *  
                          sum_prob_k[k+K_2] *
                          norm_prod_int(age[n], 
                                        mu_k[k+K_2], 
                                        lambda[g_n[n], g_k[k+K_2]][age[n]-17], 
                                        sigma_k[k+K_2]^2);
        }
        if(recall_power > 0) {
          mu_nk = pow(mu_nk, 1 - recall_power) * exp((-6 - exp(-7)/mu_nk) * recall_power);
        }
        mu_nk = mu_nk * d[n];
        target += w[n] * neg_binomial_lpmf(Y[n,k] | omega[k] * mu_nk, omega[k]);
      }
    }
  }
}

If we save the above stan model into a separate file, it can then be compiled and stored in memory as follows:

model_kernel_spline <- stan_model(file = "src/stan/degree_kernel_spline.stan")

Results

We demonstrate how to run the fitting for the names-only model. The occupations-only and names-occupations-combined models can be run in a similar way via source("src/fit/kernel_spline_occ.R") and source("src/fit/kernel_spline_comb.R"), respectively.

Names Fitting

First, we prepare the parameter values for the B-spline:

age_grid <- seq(min(data$age), max(data$age), 1)
knots <- quantile(age_grid, probs = seq(0, 1, .1))
N_K <- length(knots)
D <- 3
N_B <- N_K + D - 1

Next, we provide some initial values for the parameters in order to speed up convergence:

init_data <- list()
for(i in 1:4) {
  init_data[[i]] <- list(log_d = rep(6, nrow(names_data)),
                         inv_omega = rep(5/6, ncol(names_data)),
                         a_raw = array(0, dim = c(2,2,N_B)),
                         a0 = matrix(0, nrow = 2, ncol = 2),
                         tau = matrix(0.5, nrow = 2, ncol = 2),
                         beta = c(6, 0.1, -3),
                         log_eta = 0)
}

Finally, we gather the data (and known prior parameters) that will be passed into Stan:

mcmc_data <- list(K_1 = ncol(names_data),
                  K_2 = 0,
                  N = nrow(names_data),  
                  KG = ncol(names_data),
                  N_K = N_K,
                  N_S = length(age_grid),
                  D = D,
                  age_mean = age_mean_2014, 
                  Y = as.matrix(names_data),  
                  w = data$wgt,
                  age = data$age, 
                  g_n = as.numeric(data$Gender), 
                  g_k = c(rep(1, 6), rep(2, 6)), 
                  mu_k = mu_k_name, 
                  sigma_k = sigma_k_name, 
                  sum_prob_k = sum_prob_k_name, 
                  knots = knots,
                  X = age_grid,
                  mu_d = 6, sigma_d = 0.6, 
                  alpha_omega = 4.5, beta_omega = 0.5, 
                  alpha_rho = c(5,5), 
                  mu_beta = c(6, 0.1, -3), sigma_beta = rep(1, 3),
                  recall_power = 0, 
                  degree_regression = 1)

Now we are ready to fit the model in Stan (which takes a couple hours) and extract the posterior draws of the parameters:

fit_names_kernel_spline <- sampling(model_kernel_spline, 
                                    data = mcmc_data, 
                                    iter = 2000,
                                    init = init_data)

Once the stan run has completed, we can extract the posterior means/medians of the parameters:

degree <- list()
degree$NameSexAgeKernelSpline <- apply(extract(fit_names_kernel_spline)$d,2,median)
rho_names_kernel_spline <- apply(extract(fit_names_kernel_spline)$rho, c(2,3), mean)
beta_names_kernel_spline <- colMeans(extract(fit_names_kernel_spline)$beta)
eta_names_kernel_spline <- mean(extract(fit_names_kernel_spline)$eta)

Names Results

[1] First we take a look at the fitted degree distributions for 6 different egos (young men/women, middle-aged men/women, and old men/women).

par(mfrow=c(2,3))
plot_degrees(degree$NameSexAgeKernelSpline, ego_sex_age, ego_names_verbose, data$wgt, rm = rm_na_names)

[2] Next we plot the kernel bandwidths as a function of ego/alter gender and ego age.

plot_kernel_spline(age_grid, fit_names_kernel_spline, overlay = T, ylim = c(18,55))

plot_kernel_spline(age_grid, fit_names_kernel_spline, ylim = c(18,55))

[3] Finally, we plot the alter demographic kernels for the 6 different egos from [1].

par(mfrow=c(3,2))
plot_kernels(fit_ = fit_names_kernel_spline, 
             rho_ = rho_names_kernel_spline,
             y_cent = 70,
             y_index = floor(70)-17,
             y_min = 0,
             y_max = 100)
plot_kernels(fit_ = fit_names_kernel_spline, 
             rho_ = rho_names_kernel_spline,
             y_cent = age_mean_2014,
             y_index = floor(age_mean_2014)-17,
             y_min = 0,
             y_max = 100)
plot_kernels(fit_ = fit_names_kernel_spline, 
             rho_ = rho_names_kernel_spline,
             y_cent = 21,
             y_index = floor(21)-17,
             y_min = 0,
             y_max = 100)